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Abstract 

We propose in this work a new family of kernels for variable-length time series. Our 
work builds upon the vector autoregressive (VAR) model for multivariate stochastic pro- 
cesses: given a multivariate time series x, we consider the likelihood function p#(x) of 
different parameters 6 in the VAR model as features to describe x. To compare two time 
series x and x', we form the product of their features pe(x) • pe(x') which is integrated 
out w.r.t 6 using a matrix normal-inverse Wishart prior. Among other properties, this 
kernel can be easily computed when the dimension d of the time series is much larger 
than the lengths of the considered time series x and x'. It can also be generalized to time 
series taking values in arbitrary state spaces, as long as the state space itself is endowed 
with a kernel k. In that case, the kernel between x and x' is a a function of the Gram 
matrices produced by n on observations and subsequences of observations enumerated in 
x and x'. We describe a computationally efficient implementation of this generalization 
that uses low-rank matrix factorization techniques. These kernels are compared to other 
known kernels using a set of benchmark classification tasks carried out with support 
vector machines. 



1 Introduction 



Kernel methods Hofmann et al. . 20081 have proved useful to hand le and analyze struc- 
tured data. A non-exhaustive list of such data types includes images IChapelle et all Il999l 

Grauman and DarrellLl2005l.lCuturi and Fukumizil 2007 . Harchaoui and Bach l2*007l| . g raphs [Kashima et al. 



2003 



2002 
2004 



Ma he et al.l. 12009. IVishwanathan et al.l . 120081 . IShervashidze and Borgwardtl. 120091]. texts [Joachims! . 



Moschitti and Zanzottd. 12007 1 and strings on finite alphabet s Leslie et all 120021 . ICortes et al 



Vert et al.l . l2004l . lCuturi and Vertl . [2005LlSonnenburg et all 120071 ] . which have all drawn 



much attention in recent years. Time series, although ubiquitous in science and engineering, 
have been comparatively the subject of less research in the kernel literature. 

Num erous similarity measures and distances for time series have been proposed in the past 
decades Schreiber and Schmitzl . ll997l ]. These similarities are not, however, always well suited 
to the k ernel methods framework. First , most available similarity measures are not positive 
definite Haasdonk and Bahlmannl . 120041 ] . Likewise, most distances are not negative definite. 
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The positive definiteness of similarity measures (alternatively the negative definiteness of 
distances) is needed to use the convex optimization algorithms that underly most kernel ma- 
chines. Positive definiteness is also the cornerstone of the reproducing kernel Hilber t spac e 
(RKHS) framework which supports these techniques Berlinet and Thomas- Agnanl . 120031 ] . 
Second, most similarities measures are only defined for 'standard' multivariate time series, 
that is time series of finite dimensional vectors. Yet, some of the main application fields 
of kernel methods include bioinformatics, natural language processing and computer vision, 
where the analysis of time series of structured objects (images, texts, graphs) remains a very 
promising field of study. Ideally, a useful kernel for time series should be bot h positive definite 



and able to handle time series of structured data. An oft-quoted example Bahlmann et al. 



20021 . IShimodaira et all 120021 ] of a no n-positive definite simila rity for time series is the Dy- 
namic Time Warping (DTW) score Sakoei and Chibal . Il978|. arguably the most p opular 
similarity score for variable-length multivariate time series Rabiner and Juang . 19931 . §4.7]. 
Hence the DTW can only be used in a kernel machine if it is altered through ad hoc modifica- 

' 20ld ]. Some extensions of the DTW score 



tions such as diagonal regu larizations [Zhou et al. 



have addressed this issue: lHayashi et al.1 20051 ] propose to embed time series in Euclidean 



spaces such that the distance of suc h representations approximates the distance induced by 
the DTW score. ICuturi et al.1 20071 ] consider the soft-max of the alignment scores of all pos- 
sible alignments to compute a positive definite kernel for two time series. This kernel can 
also be used on two time series x = {x\, ■ ■ ■ ,x n ) and x' = (x[, ■ ■ ■ ,x' n ,) of structured ob- 
jects since the kernel between x and x' can be expressed as a function of the Gram matrix 



/C 



i<n,j<n' 



where k is a given kernel on the structured objects of interest. 



A few alternative kernels have been proposed for multivariate time series. iKumara et al 



2008] consider a non-parametric approach to interpolate time series using splines, and define 



directly kernels on these interp olated representati ons. In a paragraph of their broad work 



on probability product kernels, iJebara et al.l 20041 . §4.5] briefly mention the idea of using 



the Bhattacharyya distan ce on suitable representation s as norma l densities of two t ime s eries 



using state-space models. IVishwanathan et al.l 120071] as well a s iBorewardt et al.l [20061 ] use 



the family of Binet-Cauchy kernels Vishwanathan and Smolal . 120041 ] . originally defined by 
the coefficients of the characteristic polynomial of kernel matrices such as the matrix K, 
described above (when n = n'). Unlike other techniques listed above, these two proposals 
rely on a probabilistic modeling of the time series to define a kernel. Namely, in both the 
probability product and Binet-Cauchy approaches the kernel value is the result of a two step 
computation: each time series is first mapped onto a set of parameters that summarizes their 
dynamic behavior; the kernel is then defined as a kernel between these two sets of parameters. 

The kernels we propose in this paper also rely on a probabilistic modeling of time series 
to define kernels but do away with the two step approach detailed above. This distinction is 
discussed later in the paper in Remark [2] Ou r contribution builds upon the the covariance 
kernel fra mework propose d by 
the work of Haussler |l999| ] and 



Seegerl 1200211 . and w hose approach can be traced back to 



Jaakkola et al.l 19991 ]. who advocate the use of probabilistic 



models to extract features from structured objects. Given a measurable space X and a model, 
that is a parameterized family of probability distributions on X of the form {pe,9 G 0}, a 
kernel for two objects x,x' G X can be defined as 



fc(x,x') 



where u is a prior on the parameter space. Our work can be broadly characterized as the 
implementation of this idea for time series data, using the VAR model to define the space 
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of densities pg and a specific prior for u, the matrix-normal inverse- Wishart prior. This 
conjugate prior allows us to obtain a closed-form expression for the kernel which admits 
useful properties. 

The rest of this paper is organized as follows. Section [2] starts with a brief review of the 
Bayesian approach to dynamic linear modeling for multivariate stochastic processes, which 
provides the main tool to define autoregressive kernels on multivariate time series. We follow 
by detailing a few of the appealing properties of autoregressive kernels for multivariate time 
series, namely their infinite divisibility and their ability to handle high-dimensional time 
series of short length. We show in Section [3] that autoregressive kernels can not only be 
used on multivariate time series but also on time series taking values in any set endowed 
with a kernel. The kernel is then computed as a function of the Gram matrices of subsets of 
shorter time series found in x and x'. This computation requires itself the computation of 
large Gram matrices in addition to a large number of operations that grows cubicly with the 
lengths of x and x'. We propose in Section [3.21 to circumvent this computational burden by 
using low-rank matrix factorization of these Gram matrices. We present in Section [3] different 
experimental results using toy and real-life datasets. 



2 Autoregressive Kernels 

We introduce in this section the crux of our contribution, that is a family of kernels that 
can handle variable-length multivariate time series. Sections 12.11 12.21 and 12.31 detail the 
construction of such kernels, while Sections 12.41 and 12.51 highlight some of their properties. 



2.1 Autoregressive Kernels as an Instance of Covariance Kernels 

A vector autoregressive model of order p henceforth abbreviated as VAR(p) is a family of 
densities for Revalued stochastic processes parameterized by p matrices Ai of size dxd 
and a positive definite matrix V. Given a parameter 9 = (Ai,--- ,A p ,V) the conditional 
probability density that an observed time series x = (x\,X2, ■ ■ ■ ,x n ) has been drawn from 
model 6 given the p first observations (x%, . . . , x p ), assuming p < n, is equal to 



j x p) 



(2tt\V\ 



d(n — p) 



n ex p 

=p+i 



i=l 



where for a vector x and a positive definite matrix V the Mahalanobis norm \\x\\y is equal 

to x T y- ' 



x. 



We write |x| for the length of the time series x, n in the example above. We 
abbreviate the conditional probability p#(x|xi, • • • , x p ) as pg(x) and take for granted that the 
p first observations of the time series are not taken into account to compute the probability 
of x. We consider the set of parameters 



e 



ndxd 



ixd 



where is the cone of positive definite matrices of size d X d, to define a kernel k that 
computes a weighted sum of the product of features of two time series x and x' as 9 varies 
in 9, 



fc(x,x') 



P0(x) c (l x l) P 0(x') c(|x,|) u;(d0), 



(1) 
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where the exponential weight factor c(|x|) is used to normalize the probabilities pe(x) by the 
length of the considered time series x. 



Remark 1. The feature map x — > {pe( x )}e g e P r °duces strictly positive features. Features 
will be naturally closer to for longer time series. We follow in this work the common practice 
in the kernel methods literature, as well as the signal processing literature, to normalize to 
some extent these features so that their magnitude is independent of the size of the input 
object, namely the length of the input time series. We propose to do so by normalizing the 
probabilities pe( x ) by the lengths of the considered series. The weight c(|x|) introduced in 
Equation [7] is defined to this effect in section \K_ 



Remark 2. The formulation of Equation (fTJ) is somehow orthogonal to kernels that map 
structured objects, in this case time series, to a single density in a given model and com- 
pare dir ectly these d e nsitie s using a kernel between densities, such as probabi lity product 



kernels \Hein and Bousquei 



kernels JJebara et all \200A l , Bin et-Cauchy kernels IVishwanathan et al\. \2007{]. structural 



200d] or information diffusion kernels 'Lebanon . 2006 J. Indeed, 
such kernels rely first on the estimation of a parameter # x in a parameterized model class 
to summarize the properties of an object x, and then compare two objects x and x' by using 
a proxy kernel on # x and X /. These approaches do require that the model class, the VAR 
model in the context of this paper, properly models the considered objects, namely that x is 
well summarized by # x . The kernels we propose do not suffer from this key restriction: the 
VAR model considered in this work is never used to infer likely parameters for a time series 
x but is used instead to generate an infinite family of features. 

The kernel k is mainly defined by the prior u on the parameter space. We present in 
the next section a possible choice for this prior, the matrix-normal inverse- Wishart prior, 
which has been extensively studied in the framework of Bayesian linear regression applied to 
dynamic linear models. 



2.2 The Matrix-Normal Inverse- Wishart Prior in the Bayesian Linear Re- 
gression Framework 

Consider the regression model between a vector of explanatory variables x of R m , and an 
output variable y E M. d , y = Ax + e, where A is a d x m coefficient matrix and e is a centered 
Gaussian noise with d x d covariance matrix V. Accordingly, y follows conditionally to x a 
normal density 

y\(x,A,V) ~J\f(Ax,V). 

Given n pairs of explanatory and response vectors (xi,yi) 1<i<n weighted by n nonnegative 
coefficients t = (tx, . . . , t n ) > such that Yli=i ^ = 1> ^ ne weighted likelihood of this sample 
of n observations is defined by the expression 

n 

p(Y\X,A,V,A) =Hp(y J \x j ,A,V) t ^ 

i=l 

1 ( 1 



|27rW 2 'V 2 
1 

\2ttV 



exp — tr A(Y - AX) 1 V L (Y - AX) , 



1 exp (-± tr V~\Y - AX)A(Y - AX) T ^j . 
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where the matrices A, Y and X stand respectively for diag(ii, . . . , t n ) £ R nxn , [yi, ■ ■ • , y n ] G 
R dXn and [xi,...,x n ] G R mXn . 



The matrix-normal inverse Wishart joint distribution IWest and Harrison! 19971 . §16.4.3] 
is a natural choice to model the randomness for (A,V). The prior assumes that the rfx m 
matrix A is distributed following a centered matrix-normal density with left variance matrix 
parameter V and right variance matrix parameter O, 

p(A) = MM(A;0,V,K) = ex P (-- tr A T V^MT^ 

where !] is a m x ra positive definite matrix. Using the following notations, 

SxX = XAX^ + Q, ^ j Sy X = Y AX^ , Syy = Y AY^ , ^y\x = Syy ^yxS^ S^, . 

we can integrate out A in p(Y\X, A, V, A) to obtain 



(2) 



The matrix-normal inverse Wishart prior for {A, V) also assumes that V is distributed with 
inverse- Wishart density W^E) of inverse scale matrix £ and degrees of freedom A > 0. 
The posterior obtained by multiplying this prior by Equation ([2]) is itself proportional to an 
inverse- Wishart density with parameters W _1 (S + S y t x , 1 + A) which can be integrated to 
obtain the marginal weighted likelihood, 

o(y\x) - TT r(M ^ } 1 l s l A/2 



_> 



Using for £ the prior 1^, for the matrix I m , and discarding all constants independent of X 
and Y yields the expression 

p{Y\X) oc 



|XAX r + J m | d / 2 |Y(A - AX T (XAX T + I™)" 1 XA)Y T + / d 



1+A 



Note that the matrix = f AX T (XAX T + I m ) 1 XA in the denominator is known as the 
hat-matrix of the orthogonal least-squares regression of Y versus X. The right term in the 
denominator can be interpreted as the determinant of the weighted cross-covariance matrix 
of Y with the residues (A — Ha)Y regularized by the identity matrix. 



2.3 Bayesian Averaging over VAR Models 



Given a VAR(p) model, we represent a time series x = {x\ , • • • , x n ) as a sample X of n—p pairs 
of explanatory variables in M. pd and response variables in M. d , namely {([xj, • • • , Xp+j-i], Xp+j) , i 
1, • • , n — p}. Following a standard practice in the study of VAR models [Liitkepohll . 120051 . 
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§3], this set is better summarized by matrices 



X 



•f'l 



^n—p+1 



G RP dxn ~P, and Y 



Xpjf-l • • • x n 



ndxn—p 



Analogously, we use the corresponding notations X',Y' for a second time series x' of 
length n' . Using the notation 

N d =n + n' - 2p, 

these samples are aggregated in the M, NxN , W dxN and 



IxN 



matrices 



A = diag (-[— , ■ 

° V 9 In—p ' 



n—p 7 n'—p ' 



, W L-]),X=[XX'],Y=[YY>]. 



(3) 



n—p times 



n'—p times 



Note that by setting A = [A\, ■ ■ ■ ,A P ] and 9 = (A,V), the integrand that appears in 
Equation ([T]) can be cast as the following probability, 

W (x)^ w (x')^ =p(Y\X,A,V,A). 

Integrating out 9 using the matrix-normal inverse Wishart prior AiW^ 1 (Id, I p d) for (^4-,^) 
yields the following definition: 

Definition 1. Given two time series x, a/ and using the notations introduced in Equation ([3]) ; 
the autoregressive kernel k of order p and degrees of freedom A vis defined as 



fe(x,x') 



1 



|XAX r + I pd \2 |Y(A - AX r (XAX T + I^)" 1 XA)Y T + I d 



l+A 
2 



(4) 



2.4 Variance and Gram Based Formulations 

We show in this section that k can be reformulated in terms of Gram matrices of subsequences 
of x and x' rather than variance-covariance matrices. For two square matrices C G M qxq and 
D G M. rxr we write C ^ D when the spectrums of C and D coincide, taking into account 
multiplicity, except for the value 0. Recall first the following trivial lemma. 

Lemma 1. For two matrices A and B in ~R qxr and K rx<? respectively, AB ^ BA, and as a 
consequence \AB + I q \ = \BA + I r \ 

Based on this lemma, it is possible to establish the following result. 

Proposition 1. Let a== then the autogressive kernel k of order p and degrees of freedom 
A given in Equation ([!]) is equal to 



k(x,x') = (IX^A + Ijvl 1 -" |X T XA + Y^A + I^n 



(5) 



6 



Proof. We use Lemma Q] to rewrite the first term of the denominator of Equation (|U) using 
the Gram matrix X r X, 

|XAX T + I pd \ = |X T XA + I N \. 
Taking a closer look at the denominator, the matrix inversion lemma 1 yields the equality 



(A — AX J (XAX T + I p d) ^ XA) 
Using again Lemma [I] the denominator of Equation d3|) can be reformulated as 

|X T X + A" 1 + Y T Y| 



(X T X + A -1 ) -1 . 



Y(A - AX T (XAX T + I pd ) 1 XA)Y T + I d 



Factoring in these two results, we obtain Equation ([5]) 



|X T X + A _1 | 



We call Equation the Variance formulation and Equation ([5]) the Gram formulation 
of the autoregressive kernel k as it only depends on the Gram matrices of X and Y. Although 
both the Variance and Gram formulations of k are equal, their computational cost is different 
as detailed in the remark below. 

Remark 3. In a high N-low d setting, the computation of k requires 0{N(pd) 2 ) operations 
to compute the denominator's matrices and 0(p 3 d 3 + d 3 ) to compute their inverse, which 
yields an overall cost of the order of 0(Np 2 d 2 + (p 3 + l)d 3 ). This may seem reasonable for 
applications where the cumulated time series lengths ' N is much larger than the dimension d 
of these time series, such as speech signals or EEG data. In a low N-high d setting, which 
frequently appears in bioinformatics or video processing applications, autoregressive kernels 
can be computed using Equation (J5J) in 0((p+ l)dN 2 + N 3 ) operations. 

2.5 Infinite Divisibility of Autoregressive Kernels 

We recall that a positiv e definite kernel f unction k is infinitely divisible if for all n£l, k 1 ^ 1 



is also positive definite Berg et all 11984 . §3.2.6]. We prove in this section that under certain 
conditions on A, the degrees-of- freedom parameter of the i nverse Wishart law , the autore- 
gressive kernel is infinitely divisible. This result builds upon Cuturi et al.l . 120051 . Proposition 
3]- 

Proving the infi nite divisibi l ity of k is useful for the following two reasons: First, following 



a well-known result iBerg et a 
definiteness of — log k. Using 
of Af w onto a Hilbert space such that 



19841. S3. 2.7], t he infinite divisibility of k implies the negative 



Berg et al.l 19841 . §3.3.2] for instance, there exists a mapping <I> 



log fc(x,x) + log fe(x',x') 



log fc(x,x') 



and hence k defines a Hilbertian metric for time series which can be used with distance-based 
tools such as nearest-neighbors. 

Second, on a more practical note, the exponent d/2 in Equation ([5]) is numerically prob- 
lematic when d is large. In such a situation, the kernel matrices produced by k would be 
diagonally dominant. This is analogous to selecting a bandwidth parameter a which is too 
small when using the Gaussian kernel on high-dimensional data. By proving the infinite 
divisibility of k, the exponent d can be removed and substituted by any arbitrary exponent. 



1 (A + ucvy 



A~ X U (C- 1 + VA^uy 1 VA~ 
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To establish the infinite divisibility result, we need a few additional notation. Let «M(M ) 
be the set of positive m easures on R rf wit h finite second-moment p[xx T ] = [xx T ] £ R dxd . 
This set is a semigroup B erg et alj . ll984l | when endowed with the usual addition of measures. 



Lemma 2. For two measures \i and p! of M(R. ), the kernel 

t : (p, p!) ^ 



y/\(n + H')[xx T ] +I d \ 
is an infinitely divisible positive definite kernel. 

Proof. The following identity is valid for any d x d positive-definite matrix £ 

1 1 r -hv T P+I d )y 



e~2 y ^ +ld ' v dy 



Given a measure p with finite second- moment on M. d , we thus have 

1 



^/\p[x X T ]+I d \ 



\{ l x[xx T ],yy T ) pAf{QJd) (dy) (6) 



where ( , ) stands for the Frobenius dot-product bet ween matrices and pa/yo r d ) is the standard 
multivariate normal density. In the formalism of Berg et al. . 19841 ] the integral of Equa- 
tion (|6|) is an integral of bounded semicharacters on the semigroup (M(R d ),+) equipped 
with autoinvolution. Each semicharacter p y is indexed by a vector y € R d as 



To verify that p y is a semicharacter notice that ^(0) = 1, where is the zero measure, 
and p y (p + //) = p y (p) Py(p') for two measures of Ai(W i ). Now, using the fact that the 
multivariate normal density is a stable distribution, one has that for any t 6 IN, 



yJ\p[xX T ] + I d \ 



m™ T \>yv T ) PmMt) (dy) 



e 24 



where p® is the t-th convolution of density p, which proves the result. ■ 

Theorem 2. For < a < 1, equivalently for < A < d — 1, ip = f — llogfc is a negative 
definite kernel 

Proof, k is infinitely divisible as the product of two infinitely divisible kernels, T d ( 1-a ) and 
r da computed on two different representations of x and x': first as empirical measures on 
MP d with locations enumerated in the columns of X and X' respectively, and as empirical 
measures on with locations enumerated in the columns of the stacked matrices [X; Y] 

and [X';Y']. The set of weights for both representations are the uniform weights 2(n-p) 
and 2 (n/-p) • ^ ne ne g a ti ve definiteness of cp follows from, and is equivalent to, the infinite 
divisibility of fc lBerg et"afl (l984l . §3.2.7]. ■ 
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Remark 4. The infinite divisibility of the joint distribution matrix normal-inverse Wishart 
distribution would be a s uffici ent condition to obtain directly the infinite divisibility of k using 
for instance Wera et al\ 11984 - §5.5.7/. Unfortunately we have neither been able to prove this 
property nor found it in the literature. The inver s e Wis hart distribution alone is known to 
be not infinitely divisible in the general case Levti , 194& 1- We do not know either whether k 
can be proved to be infinitely divisible when A > d — 1. The condition < A < d — 1, and 
hence < a < 1 also plays an important role in Proposition^ 



In the next sections, we will usually refer to the (negative definite) autoregressive kernel 



y>(x, x') = CV + (1 - a) log |X T X + A" 



+ a log |X T X + Y T Y + A" 1 1 



(7) 



where the constant C nn i 
k itself. 



(n—p) log(2(n—p)) + (n'—p) log(2(n'— p)), rather than considering 



3 Extension to Time Series Valued in a Set Endowed with a 
Kernel 

We show in Section that autoregressive kernels can be extended quite simply to time series 
valued in arbitrary spaces by considering Gram matrices. This extension is interesting but 
can be very computationally expensive. We propose a way to mitigate this computational 
cost by using low-rank matrix factorization techniques in Section 13.21 

3.1 Autoregressive Kernels Defined Through Arbitrary Gram Matrices 

Using again notation introduced in Equation ([3]), we write Kx = X r X for the N x N Gram 
matrix of all explanatory variables contained in the joint sample X and Ky = Y T Y for the 
Gram matrix of all outputs of the local regression formulations. As stated in Equation ([7]) 
above, ip and k by extension can be defined as a function of the Gram matrices Kx and Ky- 
To alleviate notations, we introduce two functions g and / defined respectively on the cone 
of positive semidefinite matrices and on (S^) 2 : 

g: Q^loglQ + A- 1 !, / : (Q,R) h- (1 - a)g{Q) + ag(R). (8) 

Using these notations, we have 

<p(x, x') = C n>n / + f(Kx, Kx + K Y ), 

which highlights the connection between (p(x, x') and the Gram matrices Kx and Kx + Ky- 
In the context of kernel methods, the natural question brought forward by this reformulation 
is whether the linear dot-product matrices Kx and Ky in (p or k can be replaced by arbitrary 
kernel matrices Kx and Ky between the vectors in M. pd and M. d enumerated in X and Y, 
and the resulting quantity still be a valid positive definite kernel between x and x'. More 
generally, suppose that x and x' are time series of structured objects, graphs for instance. In 
such a case, can Equation (|7|) be used to define a kernel between time series of graphs x and 
x' by using directly Gram matrices that measure the similarities between graphs observed in 
x and x'? We prove here this is possible. 
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Let us redefine and introduce some notations to establish this result. Given a fc-uple of 
points u = (ui, • • • , Uk) taken in an arbitrary set U, and a positive kernel k on U x U we 
write K K (u) for the k x k Gram matrix 

K"(u)= f [KfaUj)]^^. 

For two lists u and u', we write u • u' for the concatenation of u and u'. Recall that 
an empirical measure /i on a measurable set X is a finite sum of weighted Dirac masses, 
(j, = X^=i^"i' where the U{ E X are the locations and the U E M + the weights of such 
masses. 

Lemma 3. For two empirical measures \i and ^ defined on a set X by locations u = 
(u\, • • • , Uk) and v! = (u^, • • • ,u[) and weights t = {t\, ■ ■ ■ ,tf,) E and t' = (i^, . . . , t[) E 
M} + respectively, the function 

£ : (p, m') log \K K {u ■ v!) diag(t • t') + I k+l \ 



is a negative definite kernel. 



Proof. We follow the approach of the proof of Cuturi et al. |2005l . Theorem 7]. Consider m 

m 



measures fj,\, ■ ■ ■ , fi m and m real weights c\, ■ ■ ■ ,c m such that Y^iLi °i = 0- We prove that 
the quantity 

m 

r,r ; £(//,.// ; ;. (9) 

is necessarily non-positive. Consider the finite set S of all locations in X enumerated in 
all measures fi{. For each point u in S, we consider the function k(u,-) in the reproducing 

kernel Hilbert space H of functions defined by k. Let Hs = f span{K(u), u E 5} be the finite 
dimensional subspace of % spanned by all images in T~L of elements of S by this mapping. For 
each empirical measures /i, we consider its counterpart Vi, the empirical measure in Hs with 
the same weights and locations defined as the mapped locations of \ii in Hs- Since for two 
points u\,U2 in S we have by the reproducing property that («(iti, •), k(u2, •)) = k(u\,U2), 
we obtain that = — | logr(vj, Uj) where r is in this case cast as a positive definite 

kernel on the Euclidean space %s- Hence the left hand side of Equation @ is nonnegative 
by negative definiteness of — \ logr. ■ 

We now consider two time series x and x' taking values in an arbitrary space X. For any 
sequence x = (x%, • • • , x n ) we write x^ where 1 < i < j < n for the sequence (xi, X{+i, • • • , Xj). 
To summarize the transitions enumerated in x and x' we consider the sequences of subse- 
quences 

v — f Y P . . . Y n_1 1 Y' — <V P y /p+1 • • • Y /n ' _1 ^ 

A — l x li x 2 ! ' x n-p+lJ' A — l x l' x 2 J ; x n'-p+lJ> 

and 

Y = (xp+i, • • • , x n ), Y = (Xp_^i, • • • , X n i). 
Considering now a p.d. kernel K\ on X p and «2 on X we can build Gram matrices, 

Ki = K Kl (X -X'), k 2 = K K2 (y -y'). 
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Theorem 3. Given two time series x, a/ in X , the autoregressive negative definite kernel 
ip K of order p, parameter < a < 1 and base kernels n\ and k 2 defined as 

^(x,x') = C riX + /(K 1) K 1 + K 2 ), 

is negative definite. 

Proof. <p K is negative definite as the sum of three negative definite kernels: C n ^ n i is an additive 
function in n and n' and is thus trivially negative definite. The term (1 — a) log |i£x + A _1 | 
can be cast as (1 — a) times the negative definite kernel £ defined on measures of X p with 
kernel K\ while the term a log |ifx + Ky + A _1 | is a times the negative definite kernel £ 
defined on measures of X p x X with kernel K\ + k 2 . I 

3.2 Approximations Using Low-Rank Factorizations 

We consider in this section matrix factorization techniques to approximate the kernel matrices 
Ki and Ki + K 2 used to compute (f K (x, x') by low rank matrices. Theorem [5] provides a 
useful tool to control the tradeoff between the accuracy and the computational speed of this 
approximation . 



3.2.1 Computing / using low-rank matrices 



Consider an N x mi matrix gi and an N x m 2 matrix g 2 such that Gi = f gig^f approximates 

Ki and 
differences 



Ki and G 2 d = g 2 g 2 " approximates Ki + K 2 . Namely, such that the Frobenius norms of the 



ei^Ki-Gi, £ 2 d =Ki + K 2 -G 2) 

are small, where the Frobenius norm of a matrix M is ||Af || = Vtr M T M. 

Computing f{G\, G 2 ) requires an order of 0(N(nii +m 2 ) 2 +m\ +m 2 ) operations. Tech- 
niques to obtain such matrices gi and g 2 range from standard truncated ei genvalue decompo- 



sitions, such as the power met hod, to incomplete Ch olesky decompositions [Fine and Scheinbcrg, 



2002, Ba ch and Jordan! . 120051 ] and Nystrm methods [Williams and Seegerl . l200ll . lDrineas and Mahoneyi . 



2005f | which are arguably the most popular in the kernel methods literature. The analysis we 



propose below is valid for any factorization method. 

Proposition 4. Let < a < 1, then f defined in Equation ^ is a strictly concave function 
of (S^r) 2 which is strictly increasing in the sense that f{Q\,R\) < /(Q 2 ,.R 2 ) if Q2 >~ Qi an d 

R2 y R\. 

Proof. The gradient of g : Q h> log|Q + A" 1 ) is Vg(Q) = (Q + A -1 ) -1 which is thus a 
positive definite matrix. As a consequence, / is a strictly increasing function. The Jacobian 
of this gradient evaluated at Q is the linear map e G 1— > — ti(Q + A— l)^ 1 e{Q + A -1 ) -1 . 
For any matrix C y the Hessian of g computed at Q is thus the quadratic form 



V 2 g(Q) : (e, u) -»■ - tr(Q + A^y^iQ + A" 1 )" 1 



v 



Since tr UVUV = ti((VUVVU) 2 ) > for any two matrices U, V y 0, V 2 g(Q)(e, e) is negative 
for any positive definite matrix e. Hence the Hessian of / is minus a positive definite quadratic 
form on (S^) 2 and thus / is strictly concave. ■ 
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We use a first order argument to bound the difference between the approximation and 
the true value of /(Ki,Ki + K2) using terms in [|ei[| and [ | era ] | : 

/(Ki,Ki + K 2 ) - (1 - a)(V<7(Gi),ei) - a(Vg(G 2 ),e 2 } 

< /(Gi, G 2 ) < /(Kx, Ki + K 2 ). 

Theorem 5. Given two time series x,x' ; for any low rank approximations G\ and G 2 in 
such that G\ X Ki and G 2 di Ki + K 2 we have that 

e -tp K (x,x.') < e -C nn ,-f(G 1 ,G 2 ) < Q _|_ p)g-¥'«( x , x ') ) 

w/jere /=exp((l - a)|| V^G^H ||ei || + a||Vg(G 2 )|||MI) - 1. 
Proof. Immediate given that / is concave and increasing ■. 

3.2.2 Early stopping criterion 

Incomplete Cholesky decomposition and Nystrm methods can build iteratively a series of 

matrices gi jt and g 2)i G R Nxt , 1 < t < N such that G ljt = g ljt g^ t and G 2 ,t = f g 2 ,tg^ 4 
increase respectively towards K4 and K4 + K 2 as t goes to N. The series gi jt and g 2jt can 
be obtained without having to compute explicitly the whole of Ki nor Ki + K 2 except for 
their diagonal. 

The iterative computations of Git an d G 2 ,t can be halted whenever an upper bound for 

each of the norms ||ei,t|| and ||e 2) t|| of the residues = f Ki — Gi^ and e 2 j = f Ki + K 2 — G 2 ^ 
goes below an approximation threshold. 

Theorem [5] can be used to produce such a stopping criterion by a rule which combines an 
upper bound on ||£i,t|| and ||e 2 ,t|| and the exact norm of the gradients of g at G^t and G 2 ,t. 
This would require computing Frobenius norms of the matrices (G^t + A -1 ) -1 , i = 1,2. 
These matrices can be updated iteratively using rank-one updates. A simpler alternative 
which we consider is to bound [|V<7(Git)|| uniformly between and K using the inequality 

(G ilt + A" 1 )" 1 r< A,i = 1, 2. 

which yields the following bound: 

e -Mx,x') < e -C Bin /-/(Gi,Ga) < e -M*,*) e ?>J (n-PKn'-P) 

We consider in the experimental section the positive definite kernel e~ t<fR , that is the scaled 
exponentiation of (p K multiplied by a bandwidth parameter t > 0. Setting a target tolerance 
a > on the ratio between the approximation of e _t ^ K and its true value, namely requiring 
that 

e -^ K (x,x')) < e -t(C nin ,+/(Gi,G a )) < (1 +CT ) e ~V K (x,x')) ) 
can be ensured by stopping the factorizations at an iteration t such that 

n n <, 21og(l + r) / (n-p){n' -p) 
(1 - a)||ei jt || + a||e 2 ,t|| < y ^ • 
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which we simplify to performing the factorizations separately, and stopping at the lowest 
iterations t\ and t% such that 



.. .. log(l+T) (n—p)(n'—p) 

"' Ml " " (l-a)tV N (10) 

log(l + r) / (n - p)(n' - p) 
l|e2 ' fel1 " at V N • 

We provide in Figure [2] of Section 14,41 an experimental assessment of this speed/accuracy 
tradeoff when computing the value of ip K . 



4 Experiments 

We provide in this section a fair assessment of the performance and efficiency of autoregressive 
kernels on different tasks. We detail in Section [4.11 the different kernels we consider in this 
benchmark. Section 14.21 and 14.31 introduce the toy and real-life datasets of this benchmark, 
results are presented in Section 14.41 before reaching the conclusion of this paper. 



4.1 Kernels and parameter tuning 

The kernels we consider in this experimental section are all of the form K = e~**, where 
$ is a negative definite kernel. We select for each kernel K the value of the bandwidth 
t as the median value m$ of <5 on all pairs of time series observed in the training fold 
times 0.5, 1 or 2, namely t € {.5m$, m$,2m$}. The selection is based on the cross val- 
idation error on the training fold for each (kernel, dataset) pair. Some kernels described 
below bear the superscript • K , which means that they are parameterized by a base kernel 
k. Given two times series x = (xi,--- ,x n ) and x' = (x[,--- ,x' n ,), this base kernel k is 
used to computed similarities between single components n(xi,Xj) or p-uples of components 
k((xj + i, • • • , Xi +P ) , (x'j +1 , • • • , x'j + )). For all superscripted kernels below, k is set to be the 

Gaussian kernel between two vectors k{x 1 y) = e~^ x ~ y ^ 2 ^ 2a2 \ where the dimension is obvious 
from the context and is either d or pd. The variance parameter a 2 is arbitrarily set to be the 
median value of all Euclidean distances \\x^ — where i < |x( r )|, j < |x( s )|, (r,s) £ 1Z 2 , 
where 1Z is a random subset of {1, 2, • • • , ^training points} 



Autoregressive kernels k^, k£ T - We consider the kernels 

U — p-tzr <p I.K _ p -tar <Pk 



parameterized by the bandwidths t ar and £" r . We set parameters a and p as a = 1/2 and p = 5 
in all experiments. The matrix factorizations used to compute approximat ions to k* r ( with r 
set to 10 -4 ) can be performed using the chol_gauss routine proposed by Shen et al. 20091 ], 



Since running separately this routine for Ki and Ki + K2 results in duplicate computations 
of portions of Ki, we have added our modifications to this routine in order to cache values of 
Ki that can be reused when evaluating Ki + K2. The routine TwoCholGauss is available on 
our website, as well as other pieces of code. We insist on the fact that, other than a,p and 
the temperature t ar , the autoregressive kernel k ar does not require any parameter tuning. 
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Bag of vectors kernel &BoV : time series (sci,-- - , x n ) can be considered as a bag of 
vectors {xi,--- ,x n } where the time-dependent information from each state's timestamp is 
deliberately ignored. Two time series x and x' can be compared through their respective 
bags {xi, • • • , x n } and {x[, • • • , x' n ,} by setting 



^ K (x,x' 



/\ 45f l 



i<n,j<n' 



and defining the kernel fcg pV (x, x') = exp (— fopy (^ .(x, x) -(-^(x^x') — 2^ K (x, x')) where 



^BoV > 0, see for instance [Hein and Bousquetl . 120051 ] . This relatively simple kernel will act 
as the baseline of our experiments, both for performance and computational time. 



Global alignment kernel &q A : The global alignment kernel Cuturi et all 120071 ] is a 
positive definite kernel that builds upon the dynamic time warping framework, by considering 
the soft-maximum of the alignment score of all possible alignments between two time series. 
We use an implementation of this kernel distributed on the web, and consider the kernel 
^GA = exp(— tGAGlobalAlignment K (x, x')), parameterized by the bandwidth £qa- Note that 
the global alignment kernel has not been proved to be infinitely divisible. Namely, is 
known to be positive definite for icA = 1> and as a consequence for tc\ E IN, but not for all 
positive values. However, the Gram matrices that were generated in these experiments have 
been found to be positive definite for all values of icA as discussed earlier in this section. 
This suggests that /cq A might indeed be infinitely divisible. 



Splines Smoothing kernel kg: Kumar a et al. 2008] use spline smoothing techniques 
to map each time series (x%, ■ ■ ■ ,x n ) onto a multivariate polynomial function p x defined on 
[0, 1]. As a pre-processing step, each time series is mapped onto a multivariate time series of 
arbitrary length x (set to 200 in our experiments) such that x T x' corresponds to a relevant 
dot-product for these polynomials. We hav e modified an i mplem entation that we received 
from the authors in email correspondence. iKumara et al.l 20081 ] consider a linear kernel in 
their original paper on such representations. We have found that a Gaussian kernel between 
these two vector representations performs better and use ks = exp(— is|| x 



Remark 5. Although promising, the kernel proposed bv \Jebara et al\ \2004 - §^.5/ is embry- 
onic and leaves many open questions on its practical implementation. A simple implementa- 
tion using VAR models would not work with these experiments, since for many datasets the 
dimension d of the considered time series is comparable to or larger than their lengths ' and 
would prevent any estimation of the (pd 2 + d(d + l)/2) parameters of a VAR(p) model. A 
more advanced implementation not detailed in the original paper would be beyond the scope 
of this work. We have a l so tri ed to implement the fairly complex families of kernels described 
by I Vishwanathan et al\ [20071 ]. namely Equations (10) and (16) in that reference, but our 
implementations performed very poorly on the datasets we considered, and we hence decided 
not to report these results out of concerns for the validity of our codes. Despite repeated 
attempts, we co uld not obtain co mputer codes for these kernels from the authors, a problem 
also reported in fLin et all [jooiy . 



4.2 Toy dataset 

We study the performance of these kernels in a simple binary classification toy experiment 
that illustrates some of the merits of autoregressive kernels. We consider high dimensional 
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time series (d = 1000) dimensional of short length (n = 10) generated randomly using one of 
two VAR(l) models, 

x t +i = AiX t + e t , i = 1,2, 

where the process Et is a white Gaussian noise with covariance matrix .1/iqoo- Each time 
series' initial point is a random vector whose components are each distributed randomly 
following the uniform distribution in [—5,5]. The two matrices Ai, i = 1, 2, are sparse (10% 
of non-zero values, that is 100.000 non zero entries out of potentially one million) and have 
entries that are randomly distributed following a standard Gaussian law 2 . These matrices 
are divided by their spectral radius to ensure that their largest eigenvalue has norm smaller 
than one to ensure their stationarity. 

We draw randomly 10 time series with transition matrix A\ and 10 times with transition 
matrix A2 and use these 20 time series as a training set to learn an SVM that can discriminate 
between time series of type 1 or 2. We draw 100 test time series for each class % = 1, 2, that 
is a total of 200 time series, and test the performance of all kernels following the protocol 
outlined above. The test error is represented in the leftmost bar plot of Figure [TJ The 
autoregressive kernel k av achieves a remarkable test error of 0, whereas other kernels, including 
k£ T , make a not-so-surprisingly larger number of mistakes, given the difficulty of this task. 
One of the strongest appeals of the autoregressive kernel fe ar is that it manages to quantify 
a dynamic similarity between two time series (something that neither the Kumara kernel or 
any kernel based on alignments may achieve with so few samples) without resorting to the 
actual estimation of a density, which would of course be impossible given the samples' length. 



4.3 Real-life datasets 

We assess the performance of the kernels proposed in this paper using different benchmark 
datasets and other known ke rnels for time series. The datasets are all taken from the UCI 
Machine Learning repository Frank and Asuncion , 20ld | , except for the PEMS dataset which 



we have compiled. The datasets characteristics' are summarized in Table HJ 

Japanese Vowels: The database records utterances by nine male speakers of two 
Japanese vowels 'a' and 'e' successively. Each utterance is described as a time series of 
LPC cepstrum coefficients. The length of each time series lies within a range of 7 to 29 
observations, each observation being a vector of M 12 . The task is to guess which of the nine 
speakers pronounces a new utterance of 'a' or 'e'. We use the original split proposed by the 
authors, namely 270 utterances for training and 370 for testing. 

Libras Movement Data Set: LIBRAS is the acronym for the brazilian sign language. 
The observations are 2-dimensional time series of length 45. Each time series describes the 
location of the gravity center of a hand's coordinates in the visual plane. 15 different signs 
are considered, the training set has 24 instances of each class, for a total 360 = 24 x 15 time 
series. We consider another dataset of 585 time series for the test set. 

Handwritten characters: 2858 recordings of a pen tip trajectory were taken from the 
same writer. Each trajectory, a 3 x n matrix where n varies between 60 and 182 records the 
location of the pen and its tip force. Each trajectory belongs to one out of 20 different classes. 
The data is split into 2 balanced folds of 600 examples for training and 2258 examples for 
testing. 

Australian Language of Signs: Sensors are set on the two hands of a native signer 
communicating with the AUSLAN sign language. There are 11 sensors on each hand and 



2 In Matlab notation, A = sprandn(1000, .1) 
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Database 


d 


n 


classes 


# train 


#test 


Toy dataset 


1000 


10 


2 


20 


200 


Japanese Vowels 


12 


7-29 


9 


270 


370 


Libras 


2 


45 


15 


360 


585 


Handwritten Characters 


3 


60-182 


20 


600 


2258 


AUSLAN 


22 


45-136 


95 


600 


1865 


PEMS 


963 


144 


7 


267 


173 



Table 1: Characteristics of the different databases considered in the benchmark test 



hence 22 coordinates for each observation of the time series. The length of each time series 
ranges from 45 to 136 measurements. A sample of 27 distinct recordings is performed for each 
of the 95 considered signs, which totals 2565 time series. These are split between balanced 
train and tests sets of size 600 and 1865 respectively. Each time series in both test and 
training sets is centered individually, that is xx' is replaced by — x«, Without such a 
centering the performance of all kernels is seriously degraded, except for the autoregressive 
kernel which remains very competitive with an error below 10%. 

PEMS Database: We have downloaded 15 months worth of daily data from the Cali- 
fornia Department of Transportation PEMS website 3 . The data describes measurements at 
10 minute intervals of occupancy rate, between and 1, of different car lanes of the San 
Francisco bay area (D04) freeway system. The measurements cover the period from January 
1st 2008 to March 30th 2009. We consider each day of measurements as a single time se- 
ries of dimension 963 (the number of sensors which functioned consistently throughout the 
studied period) and length 6 x 24 = 144. The task is to classify each day as the correct day 
of the week, from Monday to Sunday, e.g. label it with an integer between 1 and 7. We 
remove public holidays from the dataset, as well as two days with anomalies (March 8 2009 
and March 9 2008) where all sensors have been seemingly turned off between 2:00 and 3:00 
AM. This leaves 440 time series in total, which are shuffled and split between 267 training 
observations and 173 test observations. We plan to donate this dataset to the UCI repository, 
and it should be available shortly. In the meantime, the dataset can be accessed in Matlab 
format on our website. 

4.4 Results and computational speed 

The kernels introduced in the Section above are paired with a standard multiclass SVM 
implementation using a one-versus-rest approach. For each kernel and training set pair, the 
SVM constant C to be used on the test set was chosen as either 1, 10 or 100, whichever 
gave the lowest cross-validation mean-error on the training fold. We report the test errors in 
Figure [TJ The errors on the test sets can be also compared with the average computation time 
per kernel evaluation graph displayed for 4 datasets in Figura21 In terms of performance, the 
autoregressive kernels perform favorably with respect to other kernels, notably the Global 
Alignment kernel, which is usually very difficult to beat. Their computational time offer a 
diametrically opposed perspective since for these benchmarks datasets the flexibility of using 
a kernel k to encode the inputs in a RKHS does not yield practical gains in performance 
but has a tremendously high computational price. On the contrary, k ar is both efficient and 
usually very fast compared to the other kernels. 

' ? http : //perns .dot . ca. gov 
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Figure 1: Test error of the 5 considered kernels on 5 different tasks split into two panels 
for better legibility of the error rates (notice the difference in scale) . The AR kernel has a 
test-error of on the toy dataset's test fold. 



4.5 Conclusion and Discussion 



We have proposed in this work two infinitely divisible kernels fc ar and k ai for time series. These 
kernels can be used within the framework of kernel machines, e.g. the SVM or kernel-PCA, or 
more generally as Hilbertian distances by using directly their logarithms (p var and c/?^ ar once 
properly normalized. The first kernel k ar computes a similarity between two multivariate 
time series with a low computational cost. This similarity is easy to implement, easy to 
tune given its infinite divisibility, and often performs similarly or better than more costly 
alternatives. The second kernel, k* T , is a generalization of k ar that can handle structured data 
by considering a local kernel K on the structures. Its computation requires the computation 
of all or a part of large Gram matrices as well as the determinant of these. Given its 
computational drawbacks, the experimental evide nce gathered in t his paper is not sufficient 



to advocate its use on vectorial data. Moreover, lEl Karouil [20101 ] has recently shown that 
the spectrum of a Gram matrix of high-dimensional points using the Gaussian kernel may, 
under certain assumptions, be very similar to the spectrum of the standard Gram matrix of 
these same points using the linear dot-product. In such a case, the sophistication brought 
forward by k^ might be gratuitous and yield similar results to the direct use of fc ar . However, 
we believe that k* r may prove particularly useful when considering time series of structured 
data. For instance, we plan to apply the kernel k ai to the classification of video segments, 
where each segment would be represented as a time varying histogram of features and k a 
suitable kernel on histograms that can take into account the similarity of features themselves. 



Our contribution follows the blueprint laid down by lSeegerl 2002] which can be effectively 
applied to other exponential models. However, we believe that the infinite divisibility of such 
kernels, which is crucial in practical applications, had not been considered before this work. 
Our result in this respect is not as general as we would wish for, in the sense that we do not 
know whether k ar remains infinitely divisible when the degrees of freedom A of the inverse 
Wishart prior exceed d — 1. In such a case, the concavity of / in Equation ([8]) would not 
be given either. Finally, although the prior that we use to define k ar is non informative, it 
might be of interest to learn the hyperparameters for these priors based on a data corpus of 
interest. 
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Handwritten Dataset 




AUSLAN Dataset 




Figure 2: These graphs provide on the left side the average time needed to compute one 
evaluation of each of the 5 kernels on the largest datasets. The average speeds (computed 
over a sample of 50 x 50 calculations) for each kernel are quantified in the y-axis. The x-axis 
is only effective for the kernel k£ r and shows the influence of the accuracy para meter r 
on that speed using the low-rank matrix factorization expression used for tp as described 
in Equation (TiT)|) . This parameter is set between 10° (poor approximation) to 10~ 7 (high 
accuracy) . The accuracy is measured by the maximum norm and the Frobenius norm of the 
difference between the two 50 x 50 </5-Gram matrices. Note that &q A is fully implemented in 
mex-C code, k£ r uses mex-C subroutines for Cholesky decomposition while all other kernels 
are implemented using standard algebra in Matlab. Preprocessing times are not counted in 
these averages. The simulations were run using an iMac 2.66 GhZ Intel Core with 4Gb of 
memory. 
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